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Abstract — A primary objective of the NASA Earth-Sun Explo- 
ration Technology Office is to understand the observed Earth cli- 
mate variability, thus enabling the determination and prediction of 
the climate’s response to both natural and human-induced forcing. 
We are currently developing a suite of computational tools that will 
allow researchers to calculate, from data, a variety of information- 
theoretic quantities such as mutual information, which can be used 
to identify relationships among climate variables, and transfer en- 
tropy, which indicates the possibility of causal interactions. Our 
tools estimate these quantities along with their associated error 
bars, the latter of which is critical for describing the degree of 
uncertainty in the estimates. This work is based upon optimal 
binning techniques that we have developed for piecewise-constant, 
histogram-style models of the underlying density functions. Two 
useful side benefits have already been discovered. The first allows 
a researcher to determine whether there exist sufficient data to es- 
timate the underlying probability density. The second permits one 
to determine an acceptable degree of round-off when compressing 
data for efficient transfer and storage. We also demonstrate how 
mutual information and transfer entropy can be applied so as to 
allow researchers not only to identify relations among climate vari- 
ables, but also to characterize and quantify their possible causal 
interactions. 


I. Introduction 

A primary objective of the NASA Earth-Sun Exploration 

Technology Office is to understand the observed Earth climate 

variability, and determine and predict the climate’s response to 

both natural and human-induced forcing. Central to this prob- 

lem is the concept of feedback and forcing. The basic idea is 

that changes in one climate subsystem will cause or force re- 
sponses in other subsystems. These responses in turn feed back 
to force other subsystems, and so on. While it is commonly as- 
sumed that these interactions can be described by linear systems 
techniques, one must appeal to large-scale averages, asymptotic 
distributions and central limi t theorems to defend such mod- 
els. In doing so, our ability to describe processes with rea- 
sonably high spatiotemporal resolution is lost in the averaging 
step. There are distinct advantages to developing feedback and 
forcing models that allow for nonlinearity. This is especially 


highlighted by the results of Lorenz’s work in modelling con- 
vection cells [1], which is used today as a textbook example 
of a nonlinear system, and historically was instrumental in the 
development of modem nonlinear dynamics. 

In the early stages of a field of science, much effort goes into 
identifying the relevant variables. This is typically a small set 
of variables that are used as parameters in idealized scientific 
models of the physical phenomenon under study. In Galileo’s 
time, he found that motion was best described by the relevant 
variables: displacement, velocity, and acceleration. Sometimes 
these scientific models are gross oversimplifications that merely 
capture the basic essence of a physical process, and sometimes 
they are highly detailed and allow one to make specific pre- 
dictions about the system. In Earth Science, the fact that the 
majority of our efforts are spent on amassing large amounts of 
data indicates that we have not yet identified the relevant vari- 
ables for many of the problems that we study. One of the aims 
of this work is to develop methods that will enable us to better 
identify relevant variables. 

A second aim of this work is to develop techniques that will 
allow us to identify relationships among these relevant vari- 
ables. As mentioned above, it is naive to expect that these vari- 
ables will interact linearly. Thus techniques that are sensitive to 
both linear and nonlinear relationships will better enable us to 
identify interactions among these variables. Information theory 
allows one to compute the amount of information that knowl- 
edge of one variable provides about another [2], [3]. Such com- 
putations are applicable to both linear and nonlinear relation- 
ships between the variables. Furthermore, they rely on higher- 
order statistics; whereas approaches such as correlation anal- 
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ponent Analysis (PCA), and Granger causality [4] are based 
on second-order statistics, which amount to approximating ev- 
erything with Gaussian distributions. An additional benefit is 
the fact that higher order generalizations of basic information- 
theoretic quantities are deeply connected to the concept of rel- 
evance [5], [6], and thus this approach is the natural methodol- 
ogy for identifying relevant variables and their interactions with 
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TTk = hkVk for the bin. This leads to the following piecewise- 
constant model h(x) of the nnknown probability density func- 
tion for the variable x 


M 


K X ) =^2 h k K(Zk-l,X,Xk), 


( 1 ) 


k=l 


where hk is the probability density of the k th bin with edges 
defined by Xk-i and Xk, and U(xk-i,x, Xk ) is the boxcar func- 
tion where 


n(x a ,x,x b ) 


0 if x < x a 

1 if x a < x < x b 
0 if x b < x 


( 2 ) 


Fig. 1. The optimal piecewise-constant probability density model generated 
from 1000 data samples drawn from a Gaussian density. The error bars indicate 
the uncertainty in the bin heights. It is superimposed over a 100-bin histogram 
that shows the irrelevant sampling variations of the data. 

one another. 

Information-theoretic computations ultimately rely on quan- 
tities such as entropy. While researchers have been estimating 
dntropy from data for years, relatively few attempts have been 
made to estimate the uncertainties associated with the entropy 
estimates. We consider this to be of paramount importance, 
since the degree to which we understand the Earth’s climate 
system can only be characterized by quantifying our uncertain- 
ties. The remainder of this paper will describe our ongoing 
efforts to estimate information-theoretic quantities from data as 
well as the associated uncertainties, and to demonstrate how 
these approaches will be used to identify relationships among 
relevant climate variables. 


For the case of equal bin widths, this density model can be re- 
written in terms of the bin probabilities itk as 

M M 

h(x) = — ir k n(a?fc_i, x, Xfc). (3) 

k=l 

where V is the width of the entire region covered by the den- 
sity model. This formalism is readily expanded into multiple 
dimensions by extending k to the status of a multi-dimensional 
index, and using Vk to represent the multi-dimensional volume, 
with V representing the multi-dimensional volume covered by 
the density model. 

To accurately describe the density function, we use the 
data to compute the optimal number of bins. This is per- 
formed by applying Bayesian probability theory [7], [8] and 
writing the posterior probability of the model parameters [9], 
which are the number of bins M and the bin probabilities 
7r = {7Ti, 7 t 2 , . . . , ttm— i}>' as a function of the N data points 
d = {di,d2, . . . , du} 


II. Density Models 

Our knowledge about a variable depends on what we know 
about the values that it can take. For instance, knowing that the 
average daytime summer beach water temperature in Hawaii is 
80°F provides some information. However, more information 
would be provided by the variance of this quantity. A complete 
quantification of our knowledge of this variable would be given 
by the probability density function. From that, one can com- 
pute the probability that the water temperature will fall within 
a given range. To apply these information-theoretic techniques, 
we first must estimate the probability density function from a 
data set. 
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where nk is the number of data points in the k th bin. Note 
that the symbol I is used to represent any prior information that 
we may have or any assumptions that we have made, such as 
assu min g that the bins are of equal width. 

Integrating over all possible bin heights gives the marginal 
posterior probability of the number of bins given the data [9] 


A. Piecewise-Constant Density Models 
We model the density function with a piecewise-constant 
model. Such a model divides the range of values of the variable 
into a set of M discrete bins and assigns a probability to each 
bin. We denote the probability that a data point is found to be in 
the k th bin by itk- The result is closely related to a histogram, 
except that the “height” of the bin hk, is the constant proba- 
bility density (bin probability divided by the bin width) over 
the region of the bin. Integrating this constant probability den- 
sity hk over the width of the bin Vk leads to a total probability 


p(M\d, /) oc 


M\ N r(f) n&gCnfc+j) 
vj r (i) M r (N + f) 


(5) 


where the T(-) is the Gamma function [10, p. 255]. The idea 
is to evaluate this posterior probability for all the values of the 
number of bins within a reasonable range and select the result 
with the greatest probability. In practice, it is often much easier 
computationally to search for the optimal number of bins M 


1 Note that there are only M — l bin probability parameters since there is a 
constraint that the sum should add to one. 
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by finding the value of M that maximizes the logarithm of the 
probability, (5) above. 

Using the joint posterior probability (4) one can compute the 
mean bin probabilities and the standard deviations from the data 
[9], The mean bin probability is 


/L \ 

— \hk) — 

Vk 


(M\ f n k + | \ 
\ V )\N+f)' 


( 6 ) 


and the associated variance of the height of the k th bin is 


2 _ (M\ 2 ( {n k + \){N-n k + ^) \ 

k \v)\ (N + f +i)(i\r + f ) 2 u> 

where the standard deviation is the square root of the variance. 
Note that bins with no counts still have a non-zero probability. 
No lack of evidence can ever prove conclusively that an event 
occurring in a given bin is impossible — just less probable. 

In this way we are able to estimate probability densities from 
data, and quantify the uncertainty in our knowledge. An exam- 
ple of a probability density model is shown in Figure 1. This 
optimal binning technique ensures that our density model in- 
cludes all the relevant information provided by the data while 
ignoring irrelevant details due to sampling variations. The re- 
sult is the most honest summary of our knowledge about the 
density function from the given data. Honest representations 
are important since they can reveal two potentially disastrous 
situations: insufficient data and excessive round-off error. 


B. Insufficient Data 

Without examining the uncertainties, one can never be sure 
that one has a sufficient amount of data to make an inference. 
How many data points does one need to estimate a density func- 
tion? Do we need 100 data points? 10000? a million? 

By examining the log posterior probability for the optimal 
number of bins given the data, one can easily detect whether 
one possesses sufficient data. 2 In this example (Figure 2), we 
see two density models constructed from data sampled from a 
Gaussian distribution. In the first case, we have collected 30 
data points, and in the second case, 1000. In the case of 30 data 
points, the behavior of the log posterior probability, which is the 
logarithm of (5), is very noisy with spurious ma xima . We can 
not be sure how many bins to use, and are thus very uncertain as 
to the shape of the density function from which these data were 
sampled. In the case with 1000 data points, the behavior of 
the log posterior is clear. It rises sharply as the number of pro- 
posed bins increases and reaches a peak and then gently falls 
off. The result is an estimate of the number of bins that pro- 
vides a piecewise-constant density model that, given the data, 
op timall y describes the true “unknown” Gaussian distribution- 

in our numerical experiments, we have found that for Gaus- 
sian distributed data, one needs approximately 75 to 100 data 
points to get a reasonable solution, and 150 to 200 data points 

2 The uncertainties, error bars or standard deviations are summary quantities 
that characterize the behavior of the posterior probability around the op timal 
solution. For this reason, rather than computing uncertainties, we simply look 
at the log posterior. 
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Fig. 2. A) With a small number of data points (in this case 30 data points 
sampled from a Gaussian), it is not possible to determine the probability density 
with any accuracy. B) The log posterior in this case jumps around with many 
spurious local maxima. This indicates that the inference is unreliable and more 
data are needed. C) With a sufficient amount of data (in this case 1000 data 
points), the probability density is easily estimated. D) The log posterior rises 
rapidly as the number of bins M increases reaching a peak (in this case at 
M = 14) and then falling off gently. E) hr this example, we take the previous 
1000 data points and rounded them off to the nearest 1/ 10th. As a result of this 
severe truncation, the optimal solution looks like a picket fence. The discrete 
nature of the truncated data is now a more prominent feature than the original 
probability density function. F) The log posterior detects this once the data 
have been separated into the discrete bins and can be separated no further. 


to be very certain. Note that this is a different situation than as- 
suming that we know that the underlying distribution is Gaus- 
sian and then trying to estimate the mean and variance. That 
is a very different problem where the prior knowledge that it is 
Gaussian (which would be represented by that I again) makes 
is feasible to make thejnference using significantly fewer data 
points. 


C. Excessive Round-Off 

The other problem that can occur is loss of information due 
to data compression or round-off. Many times to save memory 
space, data values are truncated to a small number of decimal 
places. When it is not clear how much information the data 
contains, it is not clear to what degree the data can be truncated 
before destroying valuable info rma tion. Our op timal binning 
technique is useful here as well. 

In the event that the data has been severely truncated, the 
optimal binning algorithm will see the discrete structure in the 
data due as being more meaningful than the overall shape of the 
underlying density function (Figure 2E). The result is that the 
optimal number of bins leads to what is called “picket fencing”, 
where the density model looks like a picket fence. There is no 








graceful way to recover from this — relevant data has been lost, 
and cannot be recovered. 

III. Entropy and Information 

We can characterize the behavior of a system X by looking 
at the set of states the system visits as it evolves in time. If a 
state is visited rarely, we would be surprised to find the system 
there. We can express the expectation (or lack of expectation) 
to find the system in state x in terms of the probability that it 
can be found in that state, p(x), by 

s ( x ) — log -7T- (8) 

p{x) 

This quantity is often called the surprise, since it is large for 
improbable events and small for probable ones. Averaging this 
quantity over all of the possible states of the system gives a 
measure of our expectation of the state of the system 

H(X)=^p(x)log-L-. (9) 

xex ' 

This quantity is called the Shannon Entropy, or entropy for 
short [2]. It can be thought of as a measure of the amount 
of information we possess about the system. It is usually ex- 
pressed by rewriting the fraction above using the properties of 
the logarithm 

tf(X) = -£>(x)logp(z). (10) 

xex 

Note that changing the base of the logarithm merely changes 
the units in which entropy is measured. When the logarithm 
base is 2, entropy is measured in bits, and when it is base e, it 
is measured in nats. 

If the system states can be described with multiple parame- 
ters, we can use them jointly to describe the state of the system. 
The entropy can still be computed by averaging over all possi- 
ble states. For two subsystems X and Y the joint entropy is 

H(X, Y) = — EE p(x,y) log p{x,y). (11) 

xeXy€Y 

The differences of entropies are useful quantities. Consider 
the difference between the joint entropy H ( X , Y) and the indi- 
vidual entropies H{X) and H ( Y ) 

MI(X, Y ) = H(X) + H(Y) - H(X, Y). (12) 

This quantity describes the difference in the amount of infor- 
mation one possesses when one considers the system jointly in- 
stead of considering the system as two individual subsystems. 
It is c all ed the Mutual Info nruition- since it describes the 

amount of information that is shared between the two subsys- 
tems. If you know something about subsystem X, the mu- 
tual information describes how much information you also pos- 
sess about Y, and vice versa. Thus MI quantifies the rele- 
vance of knowledge about one subsystem to knowledge about 
another subsystem. For this reason, it is useful for identify- 
ing and selecting a set of relevant variables that can aid in 


the prediction of another climate variable. One should note 
• that if two climate variables X and Y are independent, then 
H(X, Y) — H(X) + H(Y), then the mutual information (12) 
is zero — as one would expect. The mutual information is a 
measure of true statistical independence, whereas concepts like 
decorrelation only describe independence up to second-order. 
Two variables can be uncorrelated, yet still dependent. 3 

While the mutual information is an important quantity in 
identifying relationships between system variables, it provides 
no information regarding the causality of their interactions. The 
easiest way to see this is to note that the mutual information 
is symmetric with respect to interchange of X and Y, whereas 
causal interactions are not symmetric. To identify causal in- 
teractions, a asymmetric quantity must be utilized. Recently, 
Scbreiber [11] introduced a novel infonnation-theoretic quan- 
tity called the Transfer Entropy (TE). Consider two subsystems 
X and Y, with data in the form of two time series of measure- 
ments 

X “ {x\, X 2 , • • • , Xt, Xt+i, . . . , x n ~j 

Y = {yi,y2,---,y s ,ys+i,---,y n } 

with t = s + l where l is some lag time. The transfer entropy 
can be written as 

T(X t+1 \X U Y.) = I(X t+ 1, Y s ) - I(X t ,Xt+u Y s ) (13) 

where I(X t+1 ,T^) is the rank-2 co-information (mutual in- 
formation) and I(Xt,X t +i,Y s ) is the rank-3 co-information, 
which describes the information that all three variables share 
[12], [6]. Thus the transfer entropy is just the information 
shared by Y and future values of X minus the information 
shared by Y , X, and future values of X. In this way it captures 
the predictive information Y possesses about X and thus is an 
indicator of a possible causal interaction. Using the definitions 
of these higher-order informations, the TE can be re-written in 
the more convenient, albeit less intuitive form, originally sug- 
gested by Schreiber [11] 

T(X t+1 \X t ,Y s ) = (14) 

- H(X t ) + H(X t , Y s ) + H(X t ,X t+l ) - H{X u X t+1 , Y a ), 

where H(X t ,X t +i,Y s ) is the joint entropy between the sub- 
systems X, Y, and a time-shifted version of X, X t+ i. Unlike 
the mutual information, TE is not symmetric with interchange 
of X and Y 

T(Y s+1 \X t ,Y s ) = (15) 

- H{Y S ) + H(X t ,Y s ) + H(Y s ,Y 3+1 ) - H(X t , Y S ,Y S+1 ). 

This asymmetry is crucial since it is indicative of the ability of 
TE to identify causal interactions. 

This is the basic outline of the theory, the next section deals 
with the practical considerations of estimating these quantities 
from data and obtaining error bars to indicate the uncertainties 
in our estimates. 

3 This fact is usually poorly understood and it stems from the confusion be- 
tween the common meaning of the word ‘uncorrelated’, which we usually take 
to mean “independent”, and the precise mathematical definition of the word 
“uncorrelated”, which means that the covariance matrix is of diagonal form. 
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Entropy Estimate 

Fig. 3. This figure shows the histogram of the entropy samples computed 
from the bin probabilities drawn from a Dirichlet distribution defined by the 
data. The true entropy falls within one standard deviation of our estimate. 

IV. Estimating Entropy and Information 

Given a multi-dimensional data set, we begin by estimating 
the number of bins that will provide an optimal probability den- 
sity model. With this probability density model in hand, we can 
begin computing the information-theoretic quantities described 
above. The challenge is to propagate our uncertainties in our 
knowledge about the probability density to uncertainties in our 
knowledge about the entropy, mutual information, and transfer 
entropy estimates. 

Say that you have a variable that you know is Gaussian dis- 
tributed with zero mean and unit variance, 0, 1). If you 
want to obtain an instance of this variable that is in accordance 
with its known Gaussian probability density, you merely need 
to sample a point from a Gaussian distribution with zero mean 
and unit variance. It is easy to obtain many such instances by 
generating many samples, and it is not surprising to find that 
the mean and variance of those instances is consistent with the 
density from which they were sampled. 

We take the same approach here. Given the number of bins 
M in the probability density model, the posterior probability (4) 
of the bin heights has the form of a Dirichlet distribution. One 
can sample the bin heights from the Dirichlet distribution by 
sampling each bin height from a gamma distribution with com- 
mon scale and shape parameters and renormalizing the resulting 
set to unit probability [8, p. 482], Every set of bin height sam- 
ples that is drawn, constitutes a probability density model that 
could very well describe the given data. By taking something 
on the order of 50, 000 samples, we have a set of 50, 000 prob- 
ability density models each of which are probable descriptions 
of tiis datfl. The f ac t that we get many different, albeit similar, 
density models is a result of the fact that we are uncertain as 
to which model is correct. Without an infinite amount of data, 
we will always be uncertain — the question is: how uncertain? 
By simply computing the mean and variance of the bin heights 
from this set of samples, we can confirm that it approaches the 
theoretical mean (6), and likewise with the height variance (7). 
This sample variance, or its square root — the standard devi- 


ation, of the bin heights quantifies our uncertainty about the 
probability density. 

For each sampled probability density model, we' can compute 
the entropy. This will be given by (10) for a one-dimensional 
density function, by (1 1) for a two-dimensional density func- 
tion, and so on for higher dimensions. The result is a list of 
50 ; 000 or so entropies, from which we can readily compute the 
mean and standard deviation thus providing us with an entropy 
estimate and an associated standard deviation quantifying our 
uncertainty. 

In one experiment, 10, 000 data points were sampled from a 
Gaussian distribution with zero mean and unit variance. The 
optimal number of bins was found to be M = 24. The number 
of counts per bin for each of the 24 bins was used to sample 
50, 000 probability density models from a Dirichlet distribu- 
tion. From each of these samples, the entropy was computed. 
Figure 3 shows a histogram of the 50, 000 entropy samples. The 
mean entropy was found to be H est = 1.4231±0.007. The true 
entropy, which is H trU e — 1-1489, is within one standard de- 
viation of our estimate. This indicates that H est is a reasonable 
estimate of the entropy that simultaneously quantifies our un- 
certainty as to its precise value. 

The mutual information and transfer entropy are computed 
similarly, with the understanding that to compute the mutual in- 
formation, one works with two-dimensional density functions, 
and for the transfer entropy one works with three-dimensional 
densities. Despite the increase in dimensionality, the sam- 
pling procedure works exactly as described above for the one- 
dimensional case. 

V. Application to Climate Variables 

To demonstrated that the mutual information can identify re- 
lationships between climate variables, we performed several 
preliminary explorations. In one of our explorations, we con- 
sidered the percent cloud cover (computed as a monthly aver- 
age) as one subsystem X. These data were obtained from the 
International Satellite Cloud Climatology Project (ISCCP) cli- 
mate summary product C2 [13], [14], and consisted of monthly 
averages of percent cloud cover resulting in a time-series of 198 
months of 6596 equal-area pixels each with side length of 280 
km. It is best to think of the percent cloud cover at each pixel as 
an independent subsystem, say Xi , X 2 , • • • , X 6596 • The other 
subsystem Y was chosen to be the Cold Tongue Index (CTI), 



Fig. 4. This figure shows a pre liminar y mutual inf ormati on map, which quan- 
tifies the relationship between the Cold Tongue Index, which is indicative of the 
equatorial Pacific sea surface temperatures, and the Percent Cloud Cover across 
the globe. 
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Fig. 5. This figure shows the optimal histogram (error bars omitted) of the 
joint data formed by combining the Cold Tongue Index time series with the 
Percent Cloud Cover time series at the location where the mutual information 
was found to be maximal. 

which describes the sea surface temperature anomalies in the 
eastern equatorial Pacific Ocean (6N-6S, 180-90W)[15]. These 
anomalies are known to be indicative of the El Nino Southern 
Oscillation (ENSO)[16], [17]. Thus the second subsystem Y 
consists of the set of 198 monthly values of CTI, and corre- 
sponds in time to the cloud cover subsystems. 

The mutual information was computed between Xi and Y, 
and X% and Y, and so on by using (12). This enables us to 
generate a global map of 6596 mutual information calculations 
(Figure 4), which indicates the relationship between the Cold 
Tongue Index (CTI) and percent cloud cover across the globe. 
Note that the cloud cover affected by the sea surface tempera- 
ture (SST) variations lies mainly in the equatorial Pacific, along 
with an isolated area in Indonesia. The highlighted areas in the 
Indian longitudes are known artifacts of satellite coverage. 

Pixel 3231, which lies in the equatorial Pacific (1.25N 
191.25W), was found to have the greatest mutual information. 
Thus cloud cover at this point is maximally relevant to the CTI 
and vice versa. By taking the time series representing the per- 
cent cloud cover at this position, we can combine this with the 
CTI time series to construct an optimal two-dimensional den- 
sity model (Figure 5). This density function is not factorable 
into the product of two independent one-dimensional density 
functions. This indicates that the mutual information is non- 
zero (as we had previously determined), and that the two quan- 
tities are related in the sense that one variable provides infor- 
mation about the other. 

We are currently working to sample these density func- 
tions from their corresponding Diricblet distributions to obtain 
more accurate estimates of these information-theoretic quanti- 
ties along with error bars indicating the uncertainty in our esti- 
mates. The end result will be a set of software tools that will al- 
low researchers to rapidly and accurately estimate information- 
theoretic measures to identify, qualify and quantify causal in- 
teractions among climate variables from large climate data sets. 
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